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@ Software scheduled superscalar computer architecture. 



@ A computing system is described in which 
groups of individual instructions are executable 
in parallel by processing pipelines, and instruc- 
tions to be executed in parallel by different 
pipelines are supplied to the pipelines simul- 
taneously. During compilation of the instruc- 
tions those which can be executed in parallel 
are identified. The system includes a register for 
storing an arbitrary number of the instructions 
to be executed. The instructions to be executed 
are tagged with pipeline identification tags and 
group identification tags indicative of the 
pipeline to which they should be dispatched, 
and the group of instructions which may be 
dispatched during the same operation. The 
pipeline and group identification tags are used 
to dispatch the appropriate groups of instruc- 
tions simultaneously to the differing pipelines. 
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BACKGROUND OF THE INVENTION 

This invention relates to the architecture of conv 
puting systems, and in particular to an architecture in 
which groups of instructions nrtay be executed in par- 
allel, as well as to methods and apparatus for acconrv 
plishing that 

A common goal In the design of computer archi- 
tectures is to increase the speed of execution of a giv- 
en.set of instructions. Many solutions have been pro- 
posed for this problem, and these solutions generally 
can be divided Into two groups. 

According to a first approach, the speed of exe- 
cution of individual instructions is increased by using 
techniques directed to decreasing the time required to 
execute a group of instructions serially. Such techni- 
ques Include employing simple fixed-width instruc- 
tions, pipelined execution units, separate instruction 
and data caches, increasing the dock rate of the in- . 
structlon processor, employing a reduced set of in- 
structions, using branch prediction techniques, and 
the like. As a result it is now possible to reduce the 
number of clocks to execute an instruction to approx- 
imately one. Thus, in these approaches, the Instruc- 
tion execution rate is limited to the dock speed for the 
system. 

To push the limits of instructton execution to high- 
er levels, a second approach is to issue more than one 
instruction per dock cycle, in other words, to issue in- 
structions in parallel. This allows the Instruction exe- 
cution rate to exceed the clock rate. There are two 
dassicat approaches to parallel execution of Instruc- 
tions. 

Computing systenrts that fetch and examine sev- 
eral instructions simultaneously to find parallelism in 
existing instruction streams to determine if any can 
be issued together are known as superscalar comput- 
ing systems. In a conventional superscalar system, a 
small number of independent instructions are issued 
In each dock cyde. Techniques are provided, how- 
ever, to prevent more than one instruction from issu- 
ing if the instructions fetched are dependent upon 
each other or do not meet other special criteria. There 
is a high hardware overhead associated with this 
hardware Instruction scheduling process. Typical su- 
perscalar machines indude the Intel i960CA, the IBM 
RIOS, the Intergraph Clipper C400, the Motorola 
6811 0, the Sun SuperSparc, the Hewlett-Packard PA- 
RISC 7100, the DEC Alpha, and the Intel Pentium. 

Many researchers have proposed techniques for 
superscaler multiple instruction issue. Agerwala, T., 
and J. Cocke [1987] "High Performance Reduced In- 
struction Set Processors," IBM Tech. Rep. (March), 
proposed this approach and coined the name "super- 
sealer." IBM described a computing system based on 
these Ideas, and now manufactures and sells that ma- 
chine as the RS/6000 system. This system is capable 
of issuing up to four instructions per dock and is de- 



scribed in The IBM RISC System/6000 Processor," 
IBM J. of Res. & Develop. (January, 1990) 34:1. 

The other dassical approach to parallel instruc- 
tton execution is to employ a "wide-word" or "very 

5 long instruction word" (VUW) architecture. A VLIW 
machine requires a new Instruction set architecture 
with a wide-word format A VUW format instruction is 
a long fixed-width Instruction that encodes multiple 
concurrent operations. VtlW systems use multiple in- 
fo dependent functional units. Instead of issuing multi- 
ple independent instructions to the units, a VUW sys- 
tem combines the multiple operations into one very 
long instruction. For example, in a VUW system, mul- 
tiple Integer operations, floating point operations, and 

15 memory references may be combined in a single "in- 
struction." Each VtlW Instruction thus tndudes a set 
of fields, each of which is interpreted and supplied to 
an appropriate functional unit Although the wide- 
word instructions are fetched and executed sequen- 

20 tially. because each word controls the entire breadth 
of the parallel execution hardware, highly parallel op- 
eration results. Wide-word machines have the advan- 
tage of scheduling parallel operation statically, when 
the instructions are compiled. The fixed width In- 

25 structlon word and its parallel hardware, however, are 
designed to fit the maximum parallelism that might be 
available in the code, and most of the time far less 
parallelism is available in the code. Thus for much of 
the execution time, most of the instruction bandwidth 

30 and the instruction memory are unused. 

There is often a very limited amount of parallel- 
ism available in a randomly chosen sequence of in- 
structions, especially if the functional units are pipe- 
lined. When the units are pipelined, operations being 

35 Issued on a given dock cyde cannot depend upon the 
outcome of any of the previously issued operations al- 
ready in the pipeline. Thus, to efficiently employ 
VLIW, many more parallel operations are required 
than the number of functional units. 

40 Another disadvantage of VtlW architectures 
which results from the fixed number of slots in the 
very long instruction word for classes of instructions, 
is that a typical VLIW instruction will contain informa- 
tion in only a few of its fields. This is inefficient, re- 

45 quiring the system to be designed for a circumstance 
that occurs only rarely - a fully populated instruction 
word. 

Another disadvantage of VUW systems is the 
need to Increase the amount of code. Whenever an in- 

50 structlon is not full, the unused functional units trans- 
late to wasted bits, no-ops, in the Instruction coding. 
Thus useful memory and/or Instruction cache space 
is filled with useless no-op instructions. In short, 
VtlW machines tend to be wasteful of memory space 

55 and memory bandwidth except for only a very limited 
dass of programs. 

The term VtlW was coined by J. A. Fisher and his 
colleagues In Fisher. J. A., J. R. Qlis, J. C. Rutten- 
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berg, and A. Nicolau [1984], "Parallel Processing: A 
Smart Compiler and a Dumb Machine," Proc. SIG- 
PLAN Conf. on Compiler Constructton (June), Palo 
Alto, CA, 11-16. Such a machine was commercialized 
by Multiflow Corporation. 5 

For a more detailed description of both super- 
sealer and VLIW architectures, see Computer Archi- 
tecture " a Quantitative Approach, John L Hennessy 
and David A. Patterson, Morgan Kaufmann Publish- 
ers, 1990. 10 

SUMMARY OF THE INVENTION 

We have developed a computing system varchi- 
tecture, which we term software-scheduled super- is 
scaler, which enables instructions to be executed 
both sequentially and in parallel, yet without wasting 
space in the instruction cache or registers. Like a 
wide-word machine, we provide for static scheduling 
of concunrent operations at program compilation. In- 20 
structions are also stored and loaded into fixed width 
frames (equal to the width of a cache line). Like a su- 
perscalar machine, however, we employ a traditional 
instruction set, in which each instruction encodes 
only one basic operation (load, store, etc.). We ach- 25 
ieve concurrence by fetching and dispatching 
"groups" of simple individual instructions, arranged in 
any order. The architecture of our invention relies 
upon the compiler to assign instruction sequence 
codes to individual instructions at the time they are 30 
compiled. During execution these instruction se- 
quence codes are used to sort the instructions into 
appropriate groups and execute them In the desired 
order. Thus our architecture does not suffer the high 
hardware overhead and runtime constraints of the 35 
superscalar strategy, nor does it suffer the wasted in- 
struction bandwidth and memory typical of VLIW sys- 
tems. 

Our system includes a mechanism, an associa- 
tive crossbar, which routes in parallel each instruction 40 
in an arbitrarily selected group to an appropriate pipe- 
line, as determined by a pipeline tag applied to that in- 
struction during compilation. Preferably, the pipeline 
tag will correspond to the type of functional unit re- 
quired for execution of that instruction, e.g., floating 4S 
point unit 1 . All instructions in a selected group can be 
dispatched simultaneously. 

Thus, in one implementation, our system in- 
cludes a cache line, register, or other means for hold- 
ing at least one group of instructions to be executed so 
in parallel, each instruction in the group having asso- 
ciated therewith a pipeline identifier Indicative of the 
pipeline for executing that instruction and a group 
identifier indicative of the group of instructions to be 
executed in parallel. The group identifier causes all ss 
instructions having the same group identifier to be 
executed simultaneously, while the pipeline identifier 
causes individual instructions in the group to be sup- 



plied to an appropriate pipeline. 

In another embodiment the register holds multi- 
ple groups of instructions, and all of the instructions 
in each group having a common group identifier are 
placed next to each other, with the group of instruc- 
tions to be executed first placed at one end of the reg- 
ister, and the instructions in the group to be executed 
last placed at the other end of the register. 

In another embodiment of our invention a method 
of executing arbitrary numbers of instructions in a 
stream of instructions in parallel includes the steps of 
compiling the instructions to determine which instruc- 
tions can be executed simultaneously, assigning 
group identifiers to sets of instructions that can be 
executed in parallel, detennining a pipeline for execu- 
tion of each instruction, assigning a pipeline identifier 
to each instruction, and placing the instructions in a 
cache line or register for execution by the pipelines. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a block diagram illustrating a preferred 
implementation of this invention; 
Figure 2 is a diagram illustrating the data struc- 
ture of an instruction word in this system; 
Figure 3 is a diagram illustrating a group of in- 
struction words; 

Figure 4 is a diagram illustrating a frame contain- 
ing from one to eight groups of instructions; 
Figure 5a illustrates the frame structure for one 
maximunn-sized group of eight instructions; 
Figure 5b illustrates the frame structure for a tyi>- 
Ical mix of three intermediate sized group of in- 
structions; 

Figure 5c illustrates the frame structure for eight 
minimum-sized groups, each of one instruction; 
Figured illustrates an instruction word after pre- 
decoding; 

Figure 7 illustrates the operation of the predecod- 
er; 

Figure 8 is a diagram illustrating the overall struc- 
ture of the instruction cache; 
Figure 9 is a diagram illustrating the manner in 
which frames are selected from the instruction 
cache; 

Figure 10 is a diagram illustrating the group se- 
lection function in the associative crossbar; 
Figure 11 is a diagram illustrating the group dis- 
patch function in the associative crossbar; 
Figure 12 is a diagram illustrating a hypothetical 
frame of instructions; and 
Figure 13 Is a diagram illustrating the manner in 
which the groups of instructions in Figure 12 are 
issued on different clock cycles. 
Figure 14 is a diagram illustrating another env 
bodiment of the associative crossbar. 
Figure 15 is a diagram illustrating the group se- 
lect function in further detail. 




5 EP 0 652 

DESCRIPTION OF THE SPECIFIC 
EMBODIMENTS 

Figure 1 is a block diagram of a computer system 
according to the preferred embodiment of this inven- 5 
tion. Figure 1 illustrates the organization of the inte- 
grated circuit chips by which the computing system is 
formed. As depicted, the system includes a first inte- 
grated circuit 10 that includes a central processing 
unit, a floating point unit, and an instruction cache. io 

In the preferred embodiment the instruction 
cache is a 16 kilobyte two-way set-associative 32 
byte line cache. A set associative cache is one in 
which the lines (or blocks) can be placed only in a re- 
stricted set of locations, the line is first mapped into is 
a set, but can be placed anywhere within that set In 
a two-way set associative cache, two sets, or con> 
partments, are provided, and each line can be placed 
in one compartment or the other. 

The system also includes a data cache chip 20 20 
that comprises a 32 kilobyte four-way set-associative 
32 byte line cache. The third chip 30 of the system in- 
cludes a predecoder, a cache controller, and a menv 
ory controller. The predecoder and instruction cache 
are explained further below. For the purposes of this 25 
invention, the CPU, FPU, data cache, cache control- 
ler and memory controller all may be considered of 
conventional design. 

The communication paths among the chips are il- 
lustrated by arrows in Figure 1. As shown, the 30 
CPU/FPU and instruction cache chip communicates 
over a 32 bit wide bus 12 with the predecoder chip 30. 
The asterisk Is used to indicate that these communi- 
cations are multiplexed so that a 64 bit word is com- 
municated in two cycles. Chip 10 also receives infor- 35 
mation over 64 bit wide buses 14, 16 from the data 
cache 20, and supplies infomiation to the data cache 
20 over three 32 bit wide buses 18. 

The specific functions of the predecoder are de- 
scribed in much greater detail below; however, essen- 40 
tially it functions to decode a 32 bit instruction re- 
ceived from the secondary cache into a 64 bit word, 
and to supply that 64 bit word to the instruction cache 
on chip 10. 

The cache controller on chip 30 is activated 45 
whenever a first level cache miss occurs. Then the 
cache controller either goes to main memory or to the 
secondary cache to fetch the needed information. In 
the preferred embodiment the secondary cache lines 
are 32 bytes and the cache has an 8 kilobyte page so 
size. 

The data cache chip 20 communicates with the 
cache controller chip 30 over another 32 bit wide bus. 
In addition, the cache controller chip 30 communi- 
cates over a 64 bit wide bus 32 with the DRAM mervh ss 
ory, over a 128 bit wide bus 34 with a secondary 
cache, and over a 64 bit wide bus 36 to input/output 
devices. 
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As will be described, the system shown in Figure 
1 includes both conventional and novel features. The 
system includes multiple pipelines able to operate in 
parallel on separate instructions. The instructions 
that can be dispatched to these parallel pipelines si- 
multaneously, in what we tenm "instruction groups," 
have been identified by the compiler and tagged with 
a group identification tag. Thus, the group tag desig- 
nates instructions that can be executed simultane- 
.ously. Instructions within the group are also tagged 
with a pipeline tag indicative of the specific pipeline, 
to which that instruction should be dispatched. This 
operation is also performed by the compiler. 

In this system, each group of instructions can 
contain an arbitrary number of instructions ordered in 
an arbitrary sequence. The only limitation is that all 
instructions in the group must be capable of simulta- 
neous execution; e.g., there cannot be data depend- 
ency between instructions. The instruction groups 
are collected into larger sets and are organized into 
fixed width "frames" and stored. Each frame can con- 
tain a variable number of tightly packed instruction 
groups, depending upon the number of instructions In 
each group and on the width of the frame. 

Below we describe this concept more fully, as 
well as describe a mechanism to route in parallel each 
instruction in an arbitrarily selected group to its ap- 
propriate pipeline, as determined by the pipeline tag 
of the instruction. 

In the following description of the word, group, 
and frame concepts mentioned above, specific bit 
and byte widths are used for the word, group and 
frame. It should be appreciated that these widths are 
arbitrary, and can be varied as desired. None of the 
general mechanisms described for achieving the re- 
sult of this invention depends upon the specific imple- 
mentation. 

In one embodiment of this system the central 
processing unit includes eight functional units and is 
capable of executing eight instructions in parallel. We 
designate these pipelines using the digits O to 7. Also, 
for this explanation each instruction word is 32 bits (4 
bytes) long, with a bit, for example, the high order bit 
S being reserved as a flag for group identification. 
Figure 2 therefore shows the general format of all in- 
structions. As shown by Figure 2, bits 0 to 30 repre- 
sent the instruction, with the high order bit 31 re- 
served to flag groups of instructions, i.e.. collections 
of instructions the compiler has determined may be 
executed in parallel. 

Figure 3 illustrates a group of instructions. A 
group of instructions consists of one to eight instruc- 
tions (because there are eight pipelines In the prefer- 
red implementation) ordered in any arbitrary se- 
quence; each of which can be dispatched to a differ- 
ent parallel pipeline simultaneously. 

Figure 4 illustrates the structure of an instruction 
frame. In the preferred embodiment an instruction 
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frame is 32 bytes wide and can contain up to eight in* 
struction groups, each comprising from one to eight 
instructions. This is explained further below. 

When the instruction stream is compiled before 
execution, the compiler places instructions in the 5 
same group next to each other in any order within the 
group and then places that group in the frame. The in- 
struction groups are ordered within the frame from 
left to right according to their issue sequence. That is, 
of the groups of instructions in the frame, the first io 
group to issue is placed in the leftmost position, the 
second group to issue is placed in the next position 
to the right, etc. Thus, the last group of instructions to 
issue within that frame will be placed in the rightmost 
location in the frame. As explained, the group aff ilia* is 
tion of all instructions in the same group is indicated 
by setting the S bit (bit 31 in Figure 2) to the same val- 
ue. This value toggles back and forth from 0 to 1 to 0, 
etc., between adjacent groups to thereby identify the 
groups. Thus, all instructions in the first group in a 20 
frame have the S bit set to 0. all instructions in the 
second group have the S bit set to 1 , all instructions 
in the third group have the S bit set to 0. etc., for all 
groups of instructions in the frame. 

To clarify the use of a frame, Figure 5 illustrates 25 
three different frame structures for different hypo- 
thetical groups of instructions. In Figure 5a the frame 
structure for a group of eight instructions, all of which 
can be issued simultaneously, is shown. The instruc- 
tion words are designated WO, W1 W7. The S bit 30 

for each one of the instruction words has been set to 
0 by the compiler, thereby indicating that all eight in- 
structions can be issued simultaneously. 

Figure 5b illustrates the frame structure for a typ- 
ical mixture of three intermediate sized groups of in- 35 
structions. In Figure 5b these three groups of instruc- 
tions are designated Group 0, Group 1 and Group 2. 
Shown at the left-hand side of Figure 5b is Group 0 
that consists of two instruction words WO and W1 . 
The S bits for each of these instructions has been set 40 
to 0. Group 1 of instructions consists of three instruc- 
tion words, W2, W3 and W4, each having the S bit set 
to 1. Finally, Group 2 consists of three instruction 
words, W5. W6 and W7, each having its S bit set to 

0. 45 

Figure 5c illustrates the frame structure for eight 
minimum sized groups, each consisting of a single in- 
struction. Because each "group" of a single instruc- 
tion must be issued before the next group, the S bits 
toggle in a sequence 01010101 as shown. so 

As briefly mentioned above, in the preferred enrv 
bodiment the group identifiers are associated with in- 
dividual instructions in a group during compilation. In 
the preferred embodiment, this is achieved by compil- 
ing the instructions to be executed using a well- ss 
known compiler technology. During the compilation, 
the instructions are checked for data dependencies, 
dependence upon previous branch instructions, or 
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other conditions that preclude their execution in par- 
allel with other instructions. These steps are per- 
formed using a well-known compiler. The result of the 
compilation is a group identifier being associated 
with each instruction. It is not necessary that the 
group identifier be added to the instruction as a tag, 
as shown in the preferred embodiment and described 
further below. In an alternative approach, the group 
identifier is provided as a separate tag that is later as- 
sociated with the instruction. This makes possible the 
execution of programs on our system, without need to 
revise the word width. 

In addition, in some embodiments the compiler 
will determine the appropriate pipeline for execution 
of an individual instruction. This determination is es- 
sentially a determination of the type of instruction 
provided. For example, load instructions will be sent 
to the load pipeline, store instructions to the store pi- 
peline, etc. The association of the instruction with the 
give pipeline can be achieved either by the compiler, 
or by later examination of the instruction itself, for ex- 
ample during predecoding. 

Referring again to Figure 1, in normal operation 
the CPU will execute instructions from the instruction 
cache, according to well-known principles. On an in- 
struction cache miss, however, the entire frame con- 
taining the instruction missed is transferred from the 
main memory into the secondary cache and then into 
the primary instruction cache, qrfrom the secondary 
cache to the primary instruction cache, where it oc- 
cupies one line of the instruction cache memory. Be- 
cause instructions are only executed out of the in- 
struction cache, all instructions ultimately undergo 
the following procedure. 

At the time a frame is transferred into the instruc- 
tion cache, the instruction word in that frame is pre- 
decoded by the predecoder30 (Figure 1), which as is 
explained below decodes the retrieved instruction 
into a full 64 bit word. As part of this predecoding the 
S bit of each instruction is expanded to a full 3 bit field 
000, 001, 111, which provides the explicit binary 
group number of the instruction. In^other words, the 
predecoder. by expanding the S bit to a three bit se- 
quence explicitly provides information that the in- 
struction group 000 must execute before instruction 
group 010, although both groups would have all in- 
structions within the group have S bits set to 0. Be- 
cause of the frame rules for sequencing groups, 
these group numbers correspond to the order of issue 
of the groups of instructions. Group 0 (000) will be is- 
sued first, Group 1 (001), if present, will be issued 
second, Group 2 (01 0) will be issued third. Ultimately, 
Group 7 (111), if present, will be issued last. At the 
time of predecoding of each instruction, the S value 
of the last word in the frame, which belongs to the last 
group in the frame to issue, is stored in the tag field 
for that line in the cache, along with the 1 9 bit real ad- 
dress and a valid bit The valid bit is a bit that specifies 
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whether the information in that line in the cache is val- 
id, tf the bit is not set to "valid.* there cannot be a 
match or "hit" on this address. The S value from the 
last instruction, which S value is stored in the tag field 
of the line in the cache, provides a "countdown* value 5 
that can be used to know when to increment to the 
next cache line. 

As another part of the predecoding process, a 
new 4 bit field prefix is added to each instruction giv- 
ing the explicit pipe number of the pipeline to which io 
that instruction will be routed. The use of four bits, 
rather than three allows the system to be later ex- 
panded with additional pipelines. Thus, at the time an 
instruction is supplied from the predecoderto the in- 
struction cache, each instruction will have the format is 
shown in Figure 6. As shown by Figure 6, bits 0 to 56 
provide 57 bits for the instruction, bits 57r 58 and 59 
form the full 3 bit S field, and bits 60-63 provide the 

4 bit P field. 

Figure 7 illustrates the operation of the predecod- 20 
er in transferring a frame from memory to the instruc- 
tion cache. In the upper portion of Figure 7. the frame 
is shown with a hypothetical four groups of instruc- 
tions. The first group consists of a single instruction, 
the second group of three instructions, and each of 25 
the third and fourth groups of two instructions. As de- 
scribed, Instruction is 32 bits in length and include an 

5 bit to separate the groups. The predecoder decodes 
the instruction shown in the upper portion of Figure 7 

into the instruction shown in the lower portion of Fig- 30 
ure 7. As shown, the instructions are expanded to 64 
bit length, with each instruction including a 4 bit iden- 
tification of the pipeline to which the instruction is to 
be assigned; and the expanded group field to desig- 
nate the groups of instructions that can be executed 35 
together. For illustration, hypothetical pipeline tags 
have been applied. Additionally, the predecoder ex- 
amines each frame for the minimum number of clocks 
required to execute the frame, and that number is ap- 
pended to the address tag 45 for the line. The address 40 
tag consists of bits provided for the real address for 
the line, 1 bit to designate the validity of the frame, 
and 3 bits to specify the minimum time In number of 
clock cycles, for that frame to issue. The number of 
clocks for the frame to issue is determined by the 45 
group identification number of the last word in the 
frame. At this stage, the entire frame shown In the 
lower portion of Figure 7 is present in the instruction 
cache. 

It may be desirable to implement the system of so 
this invention on computer systems that already are 
in existence and therefore have instruction structures 
that have already been defined without fields for the 
group information, pipeline information, or t>oth. In 
this case in another embodiment of this Inventton the 55 
group and pipeline information is supplied on a differ- 
ent clock cyde, then combined with the instructions 
in the cache. Such an approach can be achieved by 



adding a "no-op" instruction with fields that Identify 
which instructions are in which group, and Identify 
the pipeline for execution of the instruction, or by sup- 
plying the informatk)n relating to the parallel instruc- 
tions in another manner. It therefore should be appre- 
ciated that the manner in which the data arrives at the 
crossbar to be processed is somewhat arbitrary. We 
use the word "associated" herein to designate the 
concept that the pipeline and group identifiers are not 
required to have a f bced relationship to the instruction 
words. That is, the pipeline and group identifiers need 
not be imbedded within the instructions themselves 
as shown in Figure 7. Instead they may arrive from 
another means, or on a different cycle. 

Figure 8 Is a simplified diagram illustrating the 
secondary cache, the predecoder, and the instruction 
cache. This drawing, as welt as Figures 9, 10 and 11, 
are used to explain the manner in which the instruc- 
tions tagged with the P and S fields are routed to their 
designated instruction pipelines. 

In Figure 8 instruction frames are fetched in a sin- 
gle transfer across a 256 bit (32 byte) wide path from 
a secoridary cache 50 into the predecoder 60. As ex- 
plained above, the predecoder expands each 32 bit 
instruction in the frame to its full 64 bit wide form and 
prefixes the P and S fields. After predecoding the 512 
bit wide instruction Is transferred into the primary in- 
struction cache 70. At the same time, tag is placed 
into the tag field 74 for that line. 

The instruction cache operates as a conventional 
physically-addressed instruction cache. In the exam- 
ple depicted in Figure 8, the instruction cache wilt 
contain 512 bit fully-expanded instruction frames of 
eight instructions each organized in two compart- 
ments of 256 lines. 

Address sources for the instruction cache arrive 
at a multiplexer 80 that selects the next address to be 
fetched. Because instructions are always machine 
words, the tow order two address bits <1:0> of the 32 
bit address field supplied to multiplexer 80 are dis-. 
carded. These two bits designate byte and half-word 
boundaries. Of the remaining 30 bits, the next three 
low order address bits <4:2>, which designate a par- 
ticular instruction word in a frame, are sent directly 
via bus 81 to the associative crossbar (explained in 
conjunction with subsequent figures). The next low 
eight address bits <12:5> are supplied over bus 82 to 
the instruction cache 70 where they are used to select 
one of the 256 lines in the instruction cache. Finally, 
the remaining 19 bits of the virtual address <31:13> 
are sent to the translation lookaside buffer (TLB) 90. 
The TLB translates these bits into the high 19 bits of 
the physical address. The TLB then supplies them 
over bus 84 to the instruction cache. In the cache they 
are compared with the tag of the selected line, to de- 
termine if there Is a "hir or a "miss" In the instruction 
cache. 

If there is a hit in the instruction cache, indicating 
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that the addressed instruction is present in the cache, 
then the selected frame containing the addressed in- 
struction is transferred across the 512 bit wide bus 73 
into the associative crossbar 100. The associative 
crossbar 100 then dispatches the addressed instruc- 
tion, with the other instructions in Its group, if any. to 

the appropriate pipelines over buses 110. Ill 117. 

Preferat)ly the bit lines from the memory ceils contain- 
ing the bits of the instruction are themselves coupled 
to the associative crossbar. This eliminates the need 
for numerous sense amplifiers, and allows the cross- 
bar to operate on the lower voltage swing information 
from the cache line directly, without the normally in- 
tervening driver circuitry to slow system operation. 

Figure 9 is a block diagram illustrating in more de- 
tail the frame selection process. As shown, bits <4:2> 
of the virtual address are supplied directly to the as- 
sociative crossbar 100 over bus 81. Bus 81. as ex- 
plained above will preferably include a pair of conduc- 
tors, the bit lines, for each data bit in the field. Bits 
<12:5> supplied over bus 82 are used to select a line 
in the instruction cache. The remaining 19 bits, trans- 
lated into the 19 high order bits <31:13> of physical 
address, are used to compare against the tags of the 
two selected lines (onefrom each compartmentof the 
cache) to determine If there is a hit in either compart- 
ment. If there is a hit. the two 512 bit wide frames are 
supplied to multiplexer 120. The choice of which line 
is ultimately supplied to associative crossbar 100 de- 
pends upon the real address bits <31:13> that are 
compared by comparators 125. The output from conrv 
parators 125 thus selects the appropriate frame for 
transfer to the crossbar 100. 

Figure 10 illustrates in more detail the group se- 
lect function of the associative crossbar. A 512 bit 
wide register 130, preferably formed by the SRAM 
cells in the instruction cache contains the frame of the 
instructions to be issued. For the purposes of illustra- 
tion, register 130 is shown as containing a frame hav- 
ing three groups of instructions, with Group 0 includ- 
ing words WO. W1 and W2; Group 1 containing words 
W3. W4 and W5; and Group 2 containing words W6 
and W7, For illustration, the instructions in Group 0 
are to be dispatched to pipelines 1, 2 and 3; the in- 
structions in Group 1 to pipelines 1,3 and 8; and the 
instructions in Group 2 to pipelines 1 and 6. The three 
S bits (group identification field) of each instruction 
in the frame are brought out to an 8:1 multiplexer 140 

over buses 131, 132, 133 138, The S field of the 

next group of instructions to be executed is present 
in a 3 bit register 145. As shown in Figure 10, the hy- 
pothetical contents of register 1 45 are 011 . These bits 
have been loaded into register 145 using bus 81 de- 
scribed in conjunction with Figure 9. Multiplexer 140 
then compares the value in this register against the 
contents of the S field in each of the instruction 
words. If the two values match, the appropriate de- . 
coder 150 is enabled, permitting the instruction word 



to be processed on that dock cycle. If the values do 
not match, the decoder is disabled and the instruction 
words are not processed on that clock cyde. In the ex- ' 
ample depicted in Figure 10, the contents of register 

5 1 45 match the S field of the Group 1 instructions. The 
resulting output, supplied over bus 142. is communi- 
cated to S register 144 and then to the decoders via 
bus 148. The S register contents enable decoders 
153. 154 and 155. all of which are in Group 001. As 

10 will be shown in Figure 11. this will enable these in- 
structions W3, W4 and W5 to be sent to the pipelines 
for processing. 

Figure 11 is a block diagram illustrating the group 
dispatching of the instructions in the group to be exe- 

15 cuted. The same registers are shown across the up- 
per portion of Figure 11 as in the lower portion of Fig- 
ure 10. As shown in Figure 11, the crossbar switch it- 
self consists of two sets of crossing pathways. In the 
horizontal direction are the pipeline pathways 180, 

20 181 187. In the vertical direction are the instruc- 
tion word paths. 190, 191, .... 197. Each of these pi- 
peline and instruction pathways is themselves a bus 
for transferring the instruction word. Each horizontal 
pipeline pathway Is coupled to a pipeline execution 

25 unit 200. 201. 202. .... 207. Each of the vertical in- 
struction word pathways 190. 191. .... 197 is coupled 
to an appropriate portk>n of register 130 (Figure 10). 

The decoders 170, 171 177 associated with 

each instruction word pathway receive the 4 bit pipe- 

30 line code from the instruction. Each decoder, for ex- 
ample decoder 1 70. provides as output eight 1 bit con- 
trol lines. One of these control lines is associated with 
each pipeline pathway crossing of that instruction 
word pathway. Selection of a decoder as described 

35 with reference to Figure 10 activates the output bit 
control line corresponding to that input pipe number. 
This signals the crossbar to dose the switch between 
the word path associated with that decoder and the 
pipe path selected by that bit line. Establishing the 

40 cross connection between these two pathways caus- 
es a selected instruction word to flow into the select- 
ed pipeline. For example, decoder 173 has received 
the pipeline bits for word W3. Word W3 has associat- 
ed with it pipeline path 1. The pipeline path 1 bits are 

45 decoded to activate switch 213 to supply instruction 
word W3 to pipeline execution unit 201 over pipeline 
path 1 81 . In a similar manner, the identification of pi- ' 
peline path 3 for decoder D4 activates switch 234 to 
supply instruction word W4 to pipeline path 3. Finally, 

50 the identification of pipeline 6 for word W5 in decoder 
D5 activates switch 265 to transfer instruction word 
W5 to pipeline execution unit 206 over pipeline path- 
way 186. Thus, instructions W3, W4 and W5 are exe- 
cuted by pipes 201. 203 and 206. respectively. 

55 The pipeline processing units 200. 201, 207 

shown in Figure 11 can carry out desired operations. 
In a preferred embodiment of the invention, each of 
the eight pipelines first indudes a sense amplifier to 
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detect the state of the signals on the bit lines. In one 
emt>odiment the pipelines include first and second ar- 
ithmetic logic units; first and second floating point 
units; first and second load units; a store unit and a 
control unit The particular pipeline to which a given 5 
instruction word is dispatched will depend upon hard- 
ware constraints as well as data dependencies. 

Figure 12 is an example of a frame and how it will 
be executed by the pipeline processors 200-207 of 
Figure 11. As shown in Figure 12 the frame includes io 
three groups of instructions. The first group, with 
group identification number 0. includes two instruc- 
tions that can be executed by the arithmetic logic unit, 
a load instruction and a store instruction. Because all 
these instructions have been assigned the same is 
group identification number by the compiler, all four 
instructions can execute in parallel. The second 
group of instructions consists of a single load instruc- 
tion and two floating point instructions. Again, be- 
cause each of these instructions has been assigned 20 
"Group 1," all three instructions can be executed in 
parallel. Finally, the last instruction word in the frame 
is a branch instruction that, based upon the compil- 
er's decision, must be executed last 

Figure 13 illustrates the execution of the instruc- 25 
tions in the frame shown in Figure 1 2. As shown, dur- 
ing the first clock the Group 0 instructions execute, 
during the second clock the load and floating point in- 
structions execute, and during the third clock the 
branch instruction executes. To prevent groups from 30 
being split across two instruction frames, an instruc- 
tion frame may be only partially filled, where the last 
group is too large to fit entirely within the remaining 
space of the frame. 

Figure 14 is a diagram illustrating another enrv 35 
bodiment of the associative crossbar. In Figure 14 
nine pipelines 0 - 8 are shown coupled to the crossbar. 
The three bit program counter PC points to one of the 
instructions in the frame, in combination with the set 
of 8 group identification bits for the frame, indicating 40 
the group affiliation of each instruction, are used to 
enable a subset of the instructions in the frame. The 
enabled instructions are those at or above the ad- 
dress indicated by the PC that belong to the current 
group. 45 

The execution ports that connect to the pipelines 
specified by the pipeline identification bits of the en- 
abled instructions are then selected to multiplex out 
the appropriate instructions from the current frame. 
If one or more of the pipelines is not ready to receive so 
a new instruction, a set of hold latches at the output 
of the execution ports prevents any of the enabled in- 
structions from issuing until the "busy" pipeline is 
free. Otherwise the instructions pass transparently 
through the hold latches into their respective pipe- ss 
lines. Accompanying the output of each port is a "port 
valid" signal that indicates whether the port has valid 
information to issue to the hold latch. 
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Figure 15 is a diagram illustrating the group se- 
lect function in further detail. This figure illustrates 
the mechanism used to enable an addressed group of 
instructions within a frame. The program counter is 
first decoded into a set of 14 bit signals. Seven of 
these signals are combined with the eight group iden- 
tifiers of the current frame to determine whether each 
of the seven instructions. 11 to 17, is or is not the start 
of a later group. This information can then be conv 
bined with the other 7 bit signals from the PC decoder 
to determine which of the eight instructions in the 
frame should be enabled. Using the pipeline identify- 
ing field each enabled instruction can be combined 
with the other 7 bit signal to determine which of the 
eight instructions in the frame should be enabled. 
Each such enabled instruction can then signal the 
execution port, as determined by the pipeline identi- 
fier, to multiplex out the enabled instruction. Thus if 
12 is enabled, and the pipeline code is 5, the select line 
from 12 to port 5 is activated, causing 12 to flow to the 
hold latch at pipe 5. 

Because the instructions that start later groups 
are known, the system can decide easily which in- 
struction starts the next group. This information is 
used to update the PC to the address of the next 
group of instructions. If no instruction in the frame be- 
gins the next group, i.e., the last instruction group has 
been dispatched to the pipelines, a flag is set The 
flag causes the next frame of instructions to be 
brought into the crossbar. The PC is then reset to 10. 
Shown in the figure is an exemplary sequence of the 
values that the PC, the instruction enable bits and the 
next frame flag take on over a sequence of eight 
clocks extending over two frames. 

The processor architecture described above pro- 
vides many unique advantages to a system using this 
invention. The system described is extremely flexible, 
enabling instructions to be executed sequentially or 
in parallel, depending entirely upon the "intelligence" 
of the compiler. As compiler technology improves, the 
described hardware can execute programs more rap- 
idly, not being limited to any particular frame width, 
number of instructions capable of parallel execution, 
or other external constraints. Importantly, the asso- 
ciative crossbar aspect of this invention relies upon 
the content of the message being decoded, not upon 
an external control circuit acting independently of the 
instructions being executed. In essence, the associa- 
tive crossbar is self directed. In the preferred embodi- 
ment the system is capable of a parallel issue of up 
to eight operations per cyde. For a more complete de- 
scription of the associative crossbar, see copending 
U.S. Application Serial No. 
filed , and entitled "Instruction 

Cache Associative Crossbar Switch." 

Although the foregoing has been a description of 
the preferred embodiment of the invention, it will be 
apparent to those of skill in the art that numerous 
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modifications and variations may be made to the in- 
vention without departing from the scope as descri- 
bed herein. For example, arbitrary numbers of pipe- 
lines, arbitrary numbers of decoders, and different ar- 
chitectures may be employed, yet rely upon the sys- 
tem we have developed. 
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1 . A computing system for executing groups of indi- 
vidual instructions in parallel by processing pipe- 
lines, the system comprising: 

storage means for holding at least one 
group of instructions to be executed in parallel, 15 
each instruction in the group having associated 
therewith a pipeline identifier indicative of the pi- 
peline for executing that instruction and a group 
identifier indicative of the group of instructions to 
be executed in parallel; 20 

means responsive to the group identifier 
for pausing all instructions having the same 
group identifier to be executed at the same time; 
and 

means responsive to the pipeline identifier 25 
of the individual instructions in the group to sup- 
ply each instruction in the group to be executed 
in parallel to an appropriate pipeline. 

2. A computing system as in claim 1 wherein the so 
storage means includes the at least one group of 
instructions, and for each instruction the storage 
means includes the group identifier and the pipe- 
line identifier. 

35 

3. A computing system as in claim 2 wherein each 
instruction in the at least one group of instruc- 
tions has associated therewith a different pipe- 
line identifier. 

40 

4. A computing system as in claim '1 wherein the 
storage means holds at least two groups of in- 
structions, and all of the instructions in each 
group having associated therewith a common 
group identifier are placed adjacent to each other 45 . 
in the storage means. 

5. A computing system as in claim 4 wherein: 

the storage means comprises a line in a 
cache memory having a fixed number of storage 50 
locations; 

the group of instructions to be executed 
first is placed at one end of the line in the cache 
memory, and the instructions in the group to be 
executed last is placed at the other end of the line 55 
in the cache memory. 

6. A method of executing arbitrary numbers of in- 



structions in a stream of instructions in parallel 
which have been compiled to determine which in- 
structions can be executed at the same time, the 
method comprising: 

in response to the compilation assigning 
group identifiers to sets of instructions which can 
be executed in parallel; 

determining a pipeline for execution of 
each instruction in a group to be executed; 

assigning a pipeline identifier to each in- 
struction in the group; and 

placing the instructions in a register for 
execution by the pipelines. 

7. A method as in daim 6 further comprising the 
step of executing a group of instructions in paral- 
lel. 

8. A method as in daim 7 wherein the register holds 
at least two groups of instructions, and the step 
of placing the instructions in the register for exe- 
cution by the pipelines comprises placing the in- 
structions in each group having associated there- 
with a conrimon group identifier adjacent to each 
other in the register. 

9. A method as in daim 8 the step of executing a 
group of instructions in parallel comprises cou- 
pling the register to detection means to receive 
the group identifier of each instruction in the reg- 
ister and the group identifier of the next group of 
instructions to be supplied to the pipelines; and 

supplying only the instructions with the 
next group identifier to the pipeline execution 
units. 

10. In a computing system in which groups of individ- 
ual instructions are executable in parallel by 
processing pipelines, a method for supplying 
each instruction in a group to be executed in par- 
allel to an appropriate pipeline, the method com- 
prising: 

storing in storage an instruction frame, the 
frame induding at least one group of instructions 
to be executed in parallel, each instruction in the 
group having associated therewith a pipeline 
identifier indicative of the pipeline which will exe- 
cute that instruction and a group identifier indica- 
tive of the group identification; 

comparing the group identifier of each in- 
struction in the instruction frame and a group 
identifier of those instructions to be next execut- 
ed in parallel; and 

using the pipeline identifier of those in- 
structions to be next executed in parallel to con- 
trol an execution unit to execute all of the instruc- 
tions in the group in separate pipelines. 
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11. In a computing system in which groups of individ- 
ual instructions are executable In parallel by 
processing pipelines, apparatus for routing each 
instruction in a group to be executed in parallel to 

an appropriate pipeline, the apparatus compris- 5 
ing: 

storage for holding at least one group of in- 
~ structions to be executed in parallel, each in- 
struction in the group having associated there- 
with a pipeline identifier indicative of the pipeline io 
for executing that instruction and a group identi- 
fier to designate among the instructions present 
in the storage those instructions which may be si- 
multaneously supplied to the processing pipe- 
lines. 15 

a crossbar having a first set of connectors 
coupled to the storage for receiving instructions 
therefrom and a second set of connectors cou- 
pled to the processing pipelines; 

. means responsive to the pipeline identifier 20 
of the individual instructions in the group for rout- 
ing individual instructions ontp appropriate ones 
of the second set of connectors, to thereby supply 
each instruction In the group to be executed in 
parallel to the appropriate pipeline. 25 

12. Apparatus as in claim 11 wherein: 

the first set of connectors consists of a set 
of first communication buses, one for each in- 
struction in the storage; 30 

the second set of connectors consists of a 
set of second communication buses, one for each 
pipeline; and 

the means responsive to the pipeline iden- 
tifier comprises: 35 

a set of decoders coupled to the 
storage to receive as first input signals the pipe- 
line Identifiers and in response thereto supply as 
output signals a switch control signal; and 

a set of switches, coupled to the de- 40 
coders, one switch at the intersection of each of 
the first set of connectors with the second set of 
connectors, the switches providing connections 
in response to receiving the switch control signal 
to thereby supply each instruction in the group to 45 
be executed in parallel to the appropriate pipe- 
line. 

13. Apparatus as in daim 12 further comprising: 

detection means coupled to receive the so 
group identifier of each instruction in the storage 
and connected to receive information regarding 
the group identifier of the next group of instruc- 
tions to be supplied to the pipelines, and in re- 
sponse thereto supply a group control signal; and 55 

wherein the set of decoders coupled to the 
storage are also coupled to the detection means 
to receive the group control signal and in re- 



sponse thereto supplies a switch control signal 
for only those instructions in the group to be sup- 
plied to the pipelines. 

14. Apparatus as in daim 13 wherein the detection 
means comprises a multiplexer coupled to re- 
ceive each of the group identifiers of instructions 
in the storage and compare them to the informa- 
tion regarding the group identifier of the next 
group of instructions to be supplied to the pipe- 
lines. 

15. Apparatus as in daim 14 wherein the multiplexer 
supplies an output signal to the decoders to indi- 
cate the group identifier of the next group of in- 
structions to be supplied to the pipelines. 

16. In a computing system in which groups of individ- 
ual Instructions are executable in parallel by 
processing pipelines, apparatus for routing each 
instruction in a group to be executed in parallel to 
an appropriate pipeline, the apparatus compris- 
ing: 

a storage for holding an instruction frame, 
the frame Induding at least one group of instruc- 
tions to be executed in parallel, each instruction, 
in the group having associated therewith a pipe- 
line identifier indicative of the pipeline to which 
that instruction is to be issued and a group iden- 
tifier indicative of the group identification; 

a crossbar switch having a first set of con- 
nectors coupled to the storage for receiving in- 
structions therefrom and a second set of con nec- 
tors coupled to the processing pipelines; 

selection means connected to receive the 
group identification of each instruction in the in- 
struction frame and connected to receive Infor- 
mation about the group identifier of those in- 
structions to be next executed in parallel for sup- 
plying in response thereto an output signal indi- 
cative of the next set of instructions to be execut- 
ed in parallel; and 

decoder means coupled to receive the out- 
put signal and each of the pipeline identifiere of 
the instructions in the storage for selectively con- 
necting ones of the first set of connectors to ones 
of the second set of connectors to thereby supply 
each instruction in the group to be executed in 
parallel to the appropriate pipeline. 

17. Apparatus as In daim 16 wherein the first set of 
connectors consists of a set of first conRmunica- 
tion buses, one for each instruction in the stor- 
age; 

the second set of connectors consists of a 
set of second communication buses, one for each 
pipeline; 

the decoder means comprises a set of de- 
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